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The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial 
importance for understanding and predicting regulatory elements in gene networks. TFBS motifs 
are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair inde- 
pendently contributes to the transcription factor (TF) binding, despite mounting evidence of inter- 
dependence between base pairs positions. The recent availability of genome- wide data on TF-bound 
DNA regions offers the possibility to revisit this question in detail for TF binding in vivo. Here, 
we use available fly and mouse ChlPseq data, and show that the independent model generally does 
not reproduce the observed statistics of TFBS, generalizing previous observations. We further show 
that TFBS description and predictability can be systematically improved by taking into account 
pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise 
interaction model is formally equivalent to the disordered Potts models of statistical mechanics and 
it generalizes previous approaches to interdependent positions. Its structure allows for co- variation 
of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of 
PWMs also have this last feature, we show that pairwise interaction models outperform them. The 
significant pairwise interactions are found to be sparse and found dominantly between consecutive 
base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown 
to give significantly different predictions than a model based on independent positions. 



AUTHOR SUMMARY 

Transcription factors are proteins that bind on DNA 
to regulate several processes such as gene transcription or 
epigenetic modifications. Being able to predict the Tran- 
scription Factor Binding Sites (TFBSs) with accuracy on 
a genome- wide scale is one of the challenges of modern bi- 
ology, as it allows for the bottom-up reconstruction of the 
gene regulatory networks. The description of the TFBSs 
has been to date mostly limited to a simple model, where 
the affinity of the protein for DNA, or binding energy, is 
the sum of independent contributions from uncorrelated 
amino-acids bound on base pairs. However, structural 
aspects are of prime importance in proteins and could 
imply appreciable correlations throughout the observed 
binding sequences. Using a statistical physics inspired 
description and high-throughput ChlPseq data for a va- 
riety of Drosophilae and mammals TFs, we show that 
such correlations exist and that accounting for their con- 
tribution greatly improves the predictability of genomic 
TFBSs. 



INTRODUCTION 

Gene regulatory networks are at the basis of our un- 
derstanding of a cell state and of the dynamics of its 
response to environmental cues. Central effectors of this 
regulation are Transcription Factors (TF) that bind on 
short DNA regulatory sequences and interact with the 
transcription apparatus or with histone-modifying pro- 
teins to alter target gene expressions [1 . The determina- 
tion of Transcription Factor Binding Sites (TFBSs) on a 
genome-wide scale is thus of importance and is the focus 



of many current experiments 0. An important feature 
of TF in eukaryotes is that their binding specificity is 
moderate and that a given TF is found to bind a vari- 
ety of different sequences in vivo [3]. The collection of 
binding sequences for a TF-DNA is widely described by 
a Position Weight Matrix (PWM) which simply gives the 
probability that a particular base pair stands at a given 
position in the TFBS. The PWM provides a full statis- 
tical description of the TFBS collection when there are 
no correlations between nucleotides at different positions. 
Provided that the TF concentration is far from satura- 
tion, the PWM description applies exactly at thermody- 
namic equilibrium in the simple case where the different 
nucleotides in the TFBS contribute independently to the 
TF-DNA interaction, such that the total binding energy 
is the mere sum of the individual contributions [4, 5j. 

Previous works have reported several cases of corre- 
lations between nucleotides at different positions in TF- 
BSs A systematic in vitro study of 104 TFs us- 
ing DNA microarrays revealed a rich picture of binding 
patterns [10], including the existence of multiple motifs, 
strong nucleotide position interdependence, and variable 
spacer motifs, where two small determining regions of the 
binding site are separated by a variable number of base 
pairs. Recently, the specificity of several hundred hu- 
man and mouse DNA-binding domains was investigated 
using high-throughput SELEX. Correlations between nu- 
cleotides were found to be widespread among TFBSs and 
predominantly located between adjacent flanking bases 
in the TFBS [9 . The relevance of nucleotide correlations 
remains however debated pT] . 

On the modeling side, probabilistic models have been 
proposed to describe these correlations, either by explic- 
itly identifying mutually exclusive groups of co-varying 
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nucleotide positions [71 [T2j [13] , or by assuming a specific 
and tractable probabilistic structure such as Bayesian 
networks or Markov chains [9l[T4l[T5]. However, the ex- 
tent of nucleotide correlations in TFBSs in vivo remains 
to be assessed, and a systematic and general framework 
that accounts for the the rich landscape of observed TF 
binding behaviours is yet to be applied in this context. 
The recent breakthrough in the experimental acquisition 
of precise, genome-wide TF-bound DNA regions with 
the ChlPseq technology offers the opportunity to address 
these two important issues. Using a variety of ChlPseq 
experiments coming both from fly and mouse, we first 
show that the independent model generally does not re- 
produce well the observed TFBS statistics for a majority 
of TF. This calls for a refinement of the PWM description 
that accounts for interdependence between nucleotide po- 
sitions. 

The general problem of devising interaction parame- 
ters from observed state frequencies has been recently 
studied in different contexts where large amounts of data 
have become available. These include describing the 
probability of coinciding spikes jT6[ [TT or activation se- 
quences [m [19] in neural data, the statistics of protein 
sequences [20l [21] , and even the flight directions of birds 
in large flocks [22]. Maximum-entropy models account- 
ing for pairwise correlations in the least constrained way 
have been found to provide significant improvement over 
independent models. The PWM description of TF bind- 
ing is equivalent to the maximum entropy solely con- 
strained by nucleotide frequencies at each position. Thus, 
we propose, in the present paper, to refine this model 
by further constraining pairwise correlations between nu- 
cleotide positions. This corresponds to including effective 
pairwise interactions between nucleotides in an equilib- 
rium thermodynamic model of TF-DNA interaction, as 
already proposed ^23j. When enough data are available, 
the TFBS statistics and predictability are found to be sig- 
nificantly improved in this refined model. We consider, 
for comparison, a model that describes the statistics of 
TFBSs as a statistical mixture of PWMs [M] and gener- 
alizes previous proposals [25] . This alternative model 
can directly capture some higher-order correlations be- 
tween nucleotides but is found to be outperformed for all 
considered TF by the pairwise interaction model. 

We further show that the pairwise interaction model 
accounts for the different PWMs appearing in the mix- 
ture model by studying its energy landscape: each basin 
of attraction of a metastable energy minimum in the pair- 
wise interaction model is generally dominantly described 
by one PWM in the mixture model. Significant pairwise 
interactions between nucleotides are sparse and found 
dominantly between consecutive nucleotides, in general 
qualitative agreement with in vitro binding results |9]. 
The proposed model with pairwise interactions only re- 
quires a modest computational effort. When enough data 
are available, it should thus generally prove worth using 
the refined description of TFBS that it affords. 



RESULTS 

The PWM model does not reproduce the TFBS 
statistics 

We first tested how well the usual PWM model re- 
produced the observed TFBS statistics, i.e. how well 
the frequencies of different TFBSs were retrieved by us- 
ing only single nucleotide frequencies. For this purpose, 
we used a collection of ChlPseq data available from the 
literature [26ti28] . both from D. Melanogaster and from 
mouse embryonic stem cells (ESC) and a myogenic cell 
line (C2C12). The TFBSs are short L-mers (we take 
here L = 12), which are determined in each few hundred 
nucleotides long ChlP-bound region with the help of a 
model of TF binding. One important consequence and 
specific features of these data, is that the TFBS collec- 
tion is not independent of the model used to describe it. 
Thus, in order to self-consistently determine the collec- 
tion of binding sites for a given TF from a collection of 
ChlPseq sequences, we iteratively refined the PWM to- 
gether with the collection of TFBSs in the ChlPseq data 
(see Figure [] and Methods). This process ensured that 
the frequency of different nucleotides at a given position 
in the considered ensemble of binding sites was exactly 
accounted by the PWM. We then enquired whether the 
probability of the different binding sequences in the col- 
lection agreed with that predicted by the PWM, as would 
be the case if the probabilities of observing nucleotides 
at different positions were independent. Figure [2] dis- 
plays the results for three different TFs, one from each of 
the three considered categories: Twi (Drosophila), Esrrb 
(mammals, ESC), and MyoD (mammals, C2C12). For 
each factor, the ten most frequent sequences in the TFBS 
collection are shown. For comparison. Figure [2] also dis- 
plays the probabilities for these sequences as predicted by 
the PWM built from the TFBS collection. The indepen- 
dent PWM model strongly underestimates the probabili- 
ties of the most frequent sequences. Moreover, the PWM 
model does not correctly predict the frequency order of 
the sequences and attributes comparable probabilities to 
these different sequences, in contrast to their observed 
frequencies. 

The relative entropy or Kullback-Leibler divergence 
(DKL) is a general way to measure the difference be- 
tween two probability distributions p9] . In order to bet- 
ter quantify the differences between the observed binding 
sequence frequencies and the PWM frequencies, we com- 
puted the DKL between these distributions for all the 
considered TF, as shown in Figure |2|l). For each tran- 
scription factor T, part of the differences comes from the 
finite number N(T) of its observed binding sites. The 
results are thus compared for each factor T to DKLs be- 
tween the PWM probabilities and frequencies obtained 
for artificial sequence samples of size N(T) generated 
with the same PWM probabilities. For most TFs (22 
out of 28), the difference between the observed binding 
sequence frequencies and the PWM frequencies is signifi- 
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FIG. 1. Workflow. An initial Position Weight Matrix (PWM) is used to find a set of binding sites on CfilPseq data. Models 
are then learned using single-point frequencies (independent), two-point correlations (pairwise) or a mixture of independent 
models learned on sites clustered by K-Means (mixture) with increasing complexity, i.e. increasing number of features in the 
model. Finally the models with best Bayesian Information Criteria (BIC) are used to predict new sites until convergence to a 
stable set of sites. 



cantly larger than expected from finite size sampling. In 
the following we focus on these 22 factors for which the 
PWM description of the TFBSs needs to be be refined. 
It can be noted that the 6 factors for which the PWM 
description appears satisfactory are predominantly those 
for which the smallest number of ChIP sequences is avail- 



able (see Table 1 and Figure SI). 



Pairwise interactions in the binding energy improve 
the TFBS description 

The discrepancy between the observed statistics of TF- 
BSs and the statistics predicted by the PWM model 
calls for a re-evaluation of the PWM main hypothesis, 
namely the independence of bound nucleotides. As re- 
called above, the inverse problem of devising interac- 
tion parameters from observed frequencies of "words" 
has been recently studied in different contexts. It has 
been proposed to include systematically pairwise corre- 
lations between the "letters" comprising the words to re- 
fine the independent letter description. In the case of 
a two- letter alphabet, the obtained model is equivalent 
to the classical Ising model of statistical mechanics [30 . 
In the present case, the 4-nucleotide alphabet (A,C,G,T) 
leads to a model equivalent to the so-called inhomoge- 
neous Potts model [30] (hereafter called pairwise inter- 
action model), a generalization of the Ising model to the 
case where spins assume q values and their fields and in- 
teraction parameters depend on the sites considered. In 
this analogy, nucleotides are spins with q = 4 colors. 

In practice, the probability of observing a given word 



{si...sl) in the dataset is expressed as P[si...sl] = 
(l/Z) exp(— H[5i...5l]), where Z is a normalization con- 
stant. 1-L is formally equivalent to a Hamiltonian in the 
language of statistical mechanics, and reads: 



n[si...SL] = -^hi{Si) - ^^Ji,j{Si,Sj), 



i=l i=l j<i 

Si e {A,C,G,T} 



(1) 



The "magnetic fields" hi at each site z, along with the 
interaction parameters Jij between nucleotides at posi- 
tions i and j, are computed so as to reproduce the fre- 
quency of nucleotide usage at each position in the TFBS 
as well as the pairwise correlations between nucleotides at 
different positions (see Methods). In principle, the num- 
ber of parameters in the model is sufficient to reproduce 
the observed values of all pairwise correlations between 
nucleotides. This however would result in over-fitting 
the finite-size data with an irrealistically large number 
of parameters. Therefore, to obtain the model parame- 
ters we instead maximized the likelihood that the data 
was generated by the model with a penalty proportional 
to the numbers of parameters involved, as provided by 
the Bayesian Information Criterion (BIC) [31]. Similarly 
to the procedure followed for the PWM, the pairwise in- 
teraction model and the collection of TFBSs for a given 
factor were iteratively refined together, as schematized in 
Figure [j 

Figure |3] shows the improvement in the description of 
TFBS statistics when using the final pairwise interac- 
tion model, for the three factors chosen for illustrative 
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FIG. 2. Observed TFBS frequencies are poorly predicted by a PWM model. The observed frequencies of the 
most represented binding site sequences for the TF Twist (A), Esrrb (B) and MyoD (C) are shown (gray bars) as weh as the 
probabihties of these sequences as predicted by the PWM model (blue bars). (D) Kullback-Leibler Divergence (DKL) between 
the observed probability distribution and the independent model distribution (blue). As a control we show the mean (cyan 
bars) along with two standard deviations of the DKL between the independent model and a finite sample drawn from it (see 
Methods). A discrepancy between the observed and predicted sequence probabilities is reported for 22 out of 28 factors. 



purposes. Where the independent model failed at re- 
producing the strong amplitude and non-linear decrease 
in the frequencies of the most over-represented TFBSs, 
the pairwise interaction model provides a substantial im- 
provement in reproducing the observed statistics. The 
improvement is most apparent when comparing the fre- 
quencies of the ten most observed TFBSs between the 
model and the ChlPseq data (Figure [s] A, C, E), and is 
further shown by the statistics of the full collection of 
TFBSs (Figure ill B, D, F). 



The pairwise model ranks binding sites differently 
from the PWM 

Precise predictions of TFBSs are one important out- 
put of ChlPseq data. Moreover, they condition further 
validation experiments such as gel mobility shift assays 
or mutageneses. We therefore found it worth assessing 
the difference in TFBS predictions between pairwise and 
independent models. 

First, we compared the set of ChIP sequences retrieved 
by the independent and pairwise models model at the 
cutoff of 50% TPR (True Positive Rate) used in the learn- 
ing scheme, as shown in Figure |4]A. The non overlapping 
set of ChlPseq sequences {i.e. sequences that were picked 
by one model but not by the other) was found to range 



5 



Twist 



TGCGTGTGTGTG 

TACATATGTATG " 

TACATATGTACA 

TGCATGTGTGTG 

TACATATGTATA " 

CACACATGCACA 

CACATGTGTGTG " 

CGCGTGTGTGTG 

TGCGCGTGTGTG 

CACACATGTGTG 




□ Data 

■ Independent 

□ Mixture 

■ Pairwise 



0.00 0.01 



0.02 0.03 0.04 
Probability 



Esrrb 



CCCAAGGTCACC i 
CTCAAGGTCACC 
GTCAAGGTCAAG 
CTCAAGGTCAAG 
CCCAAGGTCACA 
TTCAAGGTCACC 
CTCAAGGTCACA 
CCCAAGGTCAGG 
TCCAAGGTCACC 
CCCAAGGTCAAG 




CT 



0.004 0.008 
Probability 

MyoD 



CTGCAGCTGCTG 
CAGCAGCTGCTG 
CAGCAGCTGCTC 
CTGCAGCTGTCC 
GAGCAGCTGCTG 
AGGCAGCTGCTG 
CAGCAGCTGCCT 
CCACAGCTGCTG 
CAGCAGCTGCCA 
TGGCAGCTGCTG 




0.012 



0.000 



0.005 0.010 
Probability 



0.015 




-5 -4 -3 -2 -1 
Observed probability (loglO) 




-4 -3 -2 -1 

Observed probability (loglO) 



FIG. 3. Models with correlations improve TFBS statistics prediction. The observed frequencies (gray bars) of the 
most represented TFBSs for Twist (A), Esrrb (B) and MyoD (C) TPs, are shown together with the probabihties of these 
sequences predicted by the independent energy model (blue bars) , the pairwise model taking into account interactions between 
nucleotides (red bars), and the K-means mixture model (green bars). (B,D,F) show the comparison between frequencies for 
all binding sequences and predicted sequence probabilities for the three models (same color code). The probability predictions 
of the pairwise model and to a lesser extent of the mixture model are in much better agreement with the observed frequencies 
than those of the PWM model. 



from a few percent for TF like Esrrb, up to about 15 
% for Twist. Thus, even when stemming from the same 
ChlPseq data, the two models can be learnt from signif- 
icantly distinct set of sites. 

Second, using the set of ChlPseq peaks on which the 
pairwise model was learned, we looked for the best pre- 



dicted sites on each ChlPseq bound fragment using both 
the pairwise and PWM models (Figure [4^). 

The overlap was found to be about 80% on average. 
The overlap between the sets comprising the two best 
TFBSs of each ChlPseq was also computed. This re- 
sulted in an overlap increase or decrease between the 
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FIG. 4. Overlap between predicted sites. (A) Venn diagrams showing the overlap between the ChIP predicted by the 
independent (blue) and pairwise (red) models. (B) Difference (one minus the proportion of shared sites) between the best sites 
predicted by pairwise and PWM models on ChlPseq peaks (light red), and the same quantity when including the next best 
predicted sites on each peak (dark red). In several cases (e.g. FosU, Max, n-Myc, Srf, Stat3, Usfl), the difference between 
predicted sites is much smaller when the two best sites are considered, indicating that the pairwise model and the PWM model 
rank differently the two best sites in ChIP peaks with multiple bound sites. 



prediction of the two models depending on the average of 
number of binding sites per retrieved ChlPseq fragment. 
In a few cases {e.g CTCF, Esrrb), the inclusion of the 
second best TFBS increased the difference between the 
two models. This generally happened when the ChlPseq 
fragments were retrieved with typically a single TFBS 
above threshold {e.g. for Essrb the TFBS specificity was 
fixed to retrieve 50% of 18453 ChlPseq and about 11000 
fragments where found by the two models — see Table I). 
In these cases, the low specificity TFBSs tended to dif- 
fer more between the two models than the very specific 
ones. In several other cases {e.g for FosU, Max, n-Myc, 
USFl), the inclusion of the second best predicted binding 
sites (Figure [4^) greatly increased the overlap between 
the two model predictions. This corresponded to cases 
for which the retrieved fragments contained on average 
two of more TFBSs about the specificity threshold (Ta- 
ble I). This showed that for these cases the prediction 
difference between the two models arose predominantly 
from a different ranking of the best TFBSs. 

In conclusion, the TFBS predictions made by the 
two models can differ significantly both in the rank of 
ChlPseq fragments and in the rank of binding sites on 
these fragments. 



Comparison with a PWM- mixture model 

When described by a PWM, the binding energies of 
a TF for different nucleotides sequences form a simple 
energy well with a single minimum at a preferred con- 
sensus sequence. Some authors have instead analyzed 
the binding specificity of transcription factors by intro- 
ducing multiple preferred sequences ^24, 25 . A model of 
this type that naturally generalizes the PWM description 
consists of using multiple PWMs [II]. We found it in- 
teresting to investigate this approach based on a mixture 
of PWMs and compare it with the pairwise interaction 
model to get some insights into potentially important 
high-order correlations that would not be captured by 
the pairwise model. As precisely described in Methods, 
an initial mixture of K PWMs was generated by grouping 
into K clusters the TFBS data for a given TF. Similarly 
to the pairwise interactions, the number of clusters K 
was constrained, to avoid over-fitting, by penalizing the 
corresponding model score using the BIC. For a given 
TF, the PWM mixture and the collection of TFBSs in 
the ChlPSeq data were refined iteratively until conver- 
gence, usually reached after 10 iterations. The results are 
shown in Figure [5}^ for the three representative factors. 
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FIG. 5. Model selection. (A) Minimisation of the Bayesian information criterion (BIC, see Methods) is used to select the 
optimal number of model parameters and avoid over- fitting the training set. The evolution of the BIC is shown for the pairwise 
model (crosses) and the PWM-mixture model (lines). Colors from dark blue to red indicate the number of interations (see 
Figfll. 

(B) Kullback-Leibler divergences (DKL) between the independent, K-means and pairwise distributions and the observed dis- 
tribution for the different TFs, for the BIC optimal parameters. We also show the DKL of the pairwise model with a finite-size 
sample of sequences drawn from it (pink, see Methods). Error bars represent two standard deviations. 



Twi, Esrrb and MyoD. 

The best description of Twi ChlPSeq data is, for in- 
stance, provided by a mixture of 5 PWMs, which cor- 
responds to 184 independent parameters. The mixture 
model yields a significant improvement when compared 
to the single-PWM model for Twi, and milder ones for 
Essrb and MyoD. In the three cases however, it proves 
inferior to the pairwise model. 

More generally. Figure [5^ shows the performances 
of the different models for all studied TFs using the 
Kullback-Leibler Divergence or DKL between the data 
distribution P{s) and the models distributions Pm{s). 
On the one hand, the mixture model improves the de- 



scription of the binding data for 12 out of 27 TFs as 
compared to the single PWM model. The mixture model 
gives in particular strong improvements in the cases for 
which the binding sites have a palindromic structure (eg 
Twi, MyoD, Myog, Max, USFl). This feature often 
stems from the fact that the TF binds DNA as a dimer, 
which could give some concreteness to the mixture model: 
the recruitment of different partners by bHLH factors like 
MyoD or Myog could indeed lead to a mixture of TFs 
binding the same sites. On the other hand, the pairwise 
model clearly outperforms the other models in all cases 
studied. 

As in the PWM case, the finite size of the datasets 
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leads us to expect fluctuations in the estimation of the 
DKL. In order to assess the magnitude of these flnite- 
size fluctuations, we computed the average DKL between 
the best-fltting (pairwise) model and a flnite-size artifl- 
cial sample drawn from its own distribution, as shown in 
Figure |5^. Values of this DKL that are larger than the 
one obtained with the real dataset are indicative of over- 
fltting, while the opposite case would suggest that the 
model is incomplete. In all cases, however, the DKL ob- 
tained with this control procedure was within error bars 
of the value computed with respect to the observed sam- 
ple, with the exception of NRSF, MyoD, and Myog, as 
seen in Figure [5^. Thus, the pairwise model is generally 
the best possible model, insofar as the available dataset 
allows us to probe. 



The metastable states of the pairwise interaction 
model 



In order to more directly relate the pairwise interac- 
tion and the mixture models, it is useful to consider the 
energy landscape of the pairwise interaction model in the 
space of all possible TFBSs. By contrast with the sim- 
ple, single-minimum energy well of the PWM model, the 
pairwise interaction model has multiple metastable en- 
ergy minima. The energy landscape of the pairwise in- 
teraction model can thus be seen as a collection of energy 
wells, each centered on its metastable energy minimum. 
The span of the different energy wells in sequence space 
can be precisely deflned as the basins of attraction of the 
different metastable minima in an energy minimizing pro- 
cedure (see Methods). This allows one to associate each 
observed TFBS to a particular energy minimum. This 
deflnes basins of attraction that are used to build repre- 
sentative PWMs for each metastable minimum together 
with a weight — the number of sequences in the basin 
of attraction — for this energy minimum. We compared 
each metastable minimum to the PWMs of the mixture 
model, by calculating the DKL between the PWM com- 
puted from the sequences in its basin of attraction and 
the PWMs of the mixture model. This gave an effective 
distance which allowed us to associate each metastable 
state to the nearest PWM of the mixture model. 

Using this procedure, we computed the set of PWMs 
and weights corresponding to the 27 considered TF pair- 
wise interaction models. Examples are shown in Figure 
[6j In the case of Twi, the PWMs of the pairwise model 
("metastable PWMs") can be clearly associated to the 
K = 5 PWMs of the mixture model. For MyoD, three 
of the 5 "metastable PWMs" can be clearly assigned to 
PWMs of the mixture model. The other two have a 
more spread out representation. The case of Esrrb is 
similar with one "metastable PWM" in clear correspon- 
dence with one PWM of the mixture model, and the other 
one less clearly so. The correspondence between the two 



representation allows one to identify some features cap- 
tured by the pairwise model. For example, in the case 
of Twist, most of the correlations are coming from the 
two nucleotides at the center of the motif, which take 
mainly 3 values among the 16 possible: CA,TG and TA. 
In the case of MyoD, the representation makes apparent 
the interdependencies between the two nucleotides fol- 
lowing the core E-Box motif, and the restriction to the 
three main cases of CT, TC and TT. 



Properties of the pairwise interactions 

The computation of the interaction parameters allows 
an analysis of some of their properties. In particular, 
it is interesting to quantify their strengths and measure 
the typical distance between interacting nucleotides. We 
address these two questions in turn. 

The concept of Direct Information was previously in- 
troduced to predict contacts between residues from large- 
scale correlation data in protein families [33 . We used it 
here to measure the strength of the pairwise interaction 
between two nucleotides. Using the previously gener- 
ated interaction parameters from the pairwise model, we 
built the Normalized Direct Information (NDI), a quan- 
tity which varies from for non-existing interactions, 
to 1 when interactions are so strong that knowing the 
amino acid identity at one position entirely determines 
the amino acid identity at the other position (see Meth- 
ods). Heatmaps displaying the results for the representa- 
tive Twist, Esrrb and MyoD factors are shown in Figure 
[7|and in Figures [S3| for the other factors. An important 
observation is that the direct information between dif- 
ferent nucleotides is rather weak — usually smaller than 
10% — but substantially larger than the direct interac- 
tion between nucleotides in the surrounding background 
(1-3%, see Figure 



S4). It is interesting to note that 



models is shown in Figure S2 for the other TFs for which 
the mixture model uses more than a single PWM. This 



such weak pairwise interactions give rise to a substan- 
tial improvement in the description of TFBS statistics, 
similarly to what was previously found in a different con- 
text \W . The pairwise interactions are furthermore ob- 
served in Figure [7| to be concentrated on a small subset of 
all possible interactions. This can be made quantitative 
by computing the Participation Ratio of the interaction 
weights, an indicator of the fraction of pairwise interac- 
tions that accounts for the observed Direct Information 
(see Methods). Here, typical values of 10 — 20% were 
found (Figure [t] and Table [ij), showing that the interac- 
tions tend to be concentrated on a few nucleotide pairs. 

The interaction weights can also be used to measure 
the typical distance between interacting nucleotides. To 
that purpose, we computed the relative weight of the Di- 
rect Information as a function of the distance between 
nucleotides (see Methods). Figure [s] A shows box plots 
that summarize the results for the considered Drosophi- 
lae and mammalian TFs. Both plots show a clear bias 
towards nearest-neighbor interactions with a strong peak 
at d = 1, and a rapid decrease for d > 2. Finally, the 
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FIG. 6. Metastable states. The DNA sequence variety described by each model is illustrated using weblogos [32]. Shown are 
PWMs built from all sites, from the PWM-mixture model, and from the basins of attraction of the pairwise interaction model 
for Twist (A), Esrrb (B), and MyoD (C). The metastable PWMs are grouped under the mixture PWMs with smallest distance 
(measured by DKL, in bits). Heatmaps showing the DKLs between metastable PWMs and mixture PWMs are displayed on 
the right for each factor (minimal DKLs are in black). The proportions of sites used for each logo are also indicated and serve 
to denote the corresponding PWM. 



dominant pair interactions are on average located in the 
flanking regions of the BS in clear anti-correlation with 
the most informative nucleotides which are on average in 
the central region (Figure [s] B). These observations for 
TF binding in vivo agree with similar ones made from 
a large recent analysis of TF binding in vitro [9^. The 



fact that for pair correlations to be important, nucleotide 
variation at a given location is required, may be one way 
to rationalize them. 
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FIG. 7. Nucleotide pair interactions. Heat maps showing the values of the Normalized Direct Information between pairs 
of nucleotides. The matrix is symmetric by definition. PWMs are shown on the side for better visualization of the interacting 
nucleotides. The participation ratio R is indicated below each heat map. 



TABLE I. Participation Ratios 



Name 


Part. Ratio 


Bin 


0.11 


Mef2 


0.19 


Twi 


0.28 


E2fl 


0.13 


Esrrb 


0.11 


Klf4 


0.16 


Nanog 


0.10 


n-Myc 


0.09 


Oct4 


0.24 


Sox2 


0.12 


Tcfcp211 


0.12 


Zfx 


0.10 


CEBPB 


0.05 


CTCF 


0.23 


E2f4 


0.14 


FosU 


0.09 


Max 


0.18 


MyoD 


0.09 


Myog 


0.09 


NRSF 


0.27 


TCF3 


0.19 


USFl 


0.07 



Alternative representation of interactions by 
Hopfield patterns 

Using a simple binary description of neurons, JJ Hop- 
field suggested, in a classic piece of work [34 , that neural 
memories could be attractors corresponding to patterns 
arising from pair interactions between neurons. These in- 
teraction patterns can be computed in the present case. 
They offer an alternative way to analyze the patterns of 



correlation from the pair-interactions between positions, 
as already proposed in a mean-field context in [35 . Be- 
cause the matrix of interactions Jij is symmetric, it can 
be diagonalized in an orthonormal basis of eigenvectors 
the Hopfield patterns in the present case, with corre- 
sponding real eigenvalues A/^. These orthonormal eigen- 
vectors correspond to the Hopfield patterns in the present 
case. The Potts energy (Eq. ([T])) for a binding sequence 
si ■ ■ ■ sl can be rewritten in terms of the Hopfield pat- 
terns as (see Methods): 

4L / L \ 2 

n = -^h,{s,)--^xJ^^^{s,)\ . (2) 

i k=l / 

Although here the presence of the diagonal h term pre- 
vents the patterns to be metastable energy states, they 
can still be useful to analyze the interaction matrix. This 
spectral decomposition of the interaction matrix is also 
similar in spirit to a principal component analysis, and 
even equivalent in the case of Gaussian variable. One 
can thus wonder how many patterns are needed to well 
approximate the full matrix of interactions J. To ad- 
dress this question, one can rank the eigenvalues Xk in 
order of decreasing moduli and note Jp the restriction of 
the interaction matrix generated by the first p eigenval- 
ues and their associated patterns. The full interaction 
matrix naturally corresponds to J^s- Approximate inter- 
action matrices obtained by keeping different numbers 
of dominant patterns are shown in Figure |9] for the three 
considered representative factors. Pairs of successive pat- 
terns appear to provide the main interaction domains in 
this representation, as is particularly clear in the case of 
MyoD. One can see in Figure [9] that Jq already closely 
approximates the full interaction matrix, a reflection in 
the present representation, that the important interac- 
tions are concentrated on a few links between pairs of 
nucleotides. 
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FIG. 8. Properties of the pair interactions. (A) Distances between interacting nucleotides. The box plots show the 
relative importance of the Normalized Direct Information as a function of the distance between interacting nucleotides. Red 
dots denote average values. (B) Sum of normalized direct informations in the TFBSs at a given position, averaged over all 
considered factors (blue line). The average site information content relative to background as a function of position is also 
shown (red line). In both quantities, the average over the two TFBS orientations has been taken. 



DISCUSSION 

The availability of ChlPseq data for many TFs has 
led us to revisit the question of nucleotide correlations in 
TFBSs. In order to perform a fully consistent analysis of 
this type of data, we have developed a workflow in which 
the TFBS collection and the model describing them are 
simultaneously obtained and refined together. We have 
found that when a sufficiently large number of TFBSs is 
available, the PWM description does not account well for 
their statistics. The general presence of correlations that 
follows from this finding, agrees with previous reports 
for particular transcription factors [6l El [24] and with 
the conclusions of large scale in vitro TF binding studies 

In order to refine the PWM description, we have an- 
alyzed a model with pairwise interactions [23], and a 
PWM mixture model [14 . Data overfitting is a concern 



for multi-parameter models and has been addressed by 
putting a penalty on the parameter number using the 
BIG. While the mixture-model improved in some cases 
the PWM description, especially for palindromic binding 
sites, a much more significant and general improvement 
was found with the pairwise interaction model. The suc- 
cess of the pairwise interaction model agrees with the 
results of its recent application (however, without the 
BIG) to high-throughput in vitro binding data [23 . It 
moreover shows that, at least in the case we considered, 
pairwise interactions are sufficient to account for higher- 
order correlations, and that an explicit description like 
the one provided by the PWM-mixture model is not nec- 
essary. For example, for Essrb, metastable states arising 
from nearest-neighbor interactions reproduce a triplet of 
flanking nucleotides with a variable spacer from the core 
motif (Figure S5). 

Our detailed analysis of the obtained interaction mod- 
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FIG. 9. Representation of interactions by Hopfield patterns. The full interaction matrix J is approximated by a 
matrix Jp built from the p Hopfield patterns with highest eigenvalue moduli. We show J2, J4, Jq and the full matrix J in 
the basis (i, h) with i — {1, 12} and h ={A,C,G,T}. Color bars are shown on the first row for each factor. For MyoD, the 
correspondence between successive pairs of patterns and distinct interaction domains is seen particularly clearly. In all cases 
the full interaction matrix is already well approximated by Jq. 



els for different TPs shows that the weights of pairwise 
interactions are generally weak. The most important are 
only about 10 % of the PWM weights, but significantly 
above the interaction weights in the surrounding back- 
ground DNA (of the order 1-3% by the same measure). 
Nonetheless, collectively these interactions significantly 
improve the model description of the TF binding data as 
found in other examples [16^ . 



We have here obtained the pairwise interaction models 
based on the principle of maximum entropy, constrained 
to account for the pair-correlations measured in the data. 
This approach has already been followed in a variety of 
biological contexts, from populations of spiking neurons 
[ini [17] to protein sequences [20] to bird flocks [22] . An 
interesting feature of these interaction models is their 
no n- convexity, which allows for the existence of many lo- 
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cal maxima in the probability distribution of sequences, 
or local minima of energy. This was noted for repertoires 
of antibodies in a single individual [21], where many of 
these local states were observed and suggested as pos- 
sible signatures of past infections. In a very different 
context, local probability maxima in the probability dis- 
tribution of retinal spiking patterns was reported and 
linked to error-correcting properties of the visual system 
[36] . In the present case of TFBSs, these local mimima 
reflect the multiplicity of binding solutions and resemble 
the individual PWMs of the mixture model. Pairwise 
interaction models thus somehow incorporate models of 
multiple PWMs while outperforming them. 

The previously considered case of protein sequences 
shares many similarities to the statistics of TFBSs, since 
correlations in protein sequences as in TFBSs reflect both 
structural and functional contraints. In proteins families, 
correlations are usually interpreted as resulting from the 
co-evolution of residues interacting with each other in 
the protein structure. These effects are hard to distin- 
guish from phylogenic correlations or other observational 
biases. Nonetheless, the inference of interaction models 
from data was successfully used to predict physical con- 
tacts between amino-acids in the tertiary structure [37 , 
and to aid molecular dynamics simulations in predicting 
protein structure [SS'-W. In the case of TFBSs, compar- 
ison between in vitro[9, 10 and in vivo binding data may 
help to disentangle the different possible origins of the 
found correlations and seems worth pursuing. It appears 
similarly interesting to study how much of the found pair 
correlations can be explained on the basis of structural 
data. Finally, the role of nucleotide interaction in TFBS 
evolution [41 should be considered and could improve 
the reconstruction of TFBSs from multi-species compar- 
ison [42ti44j. 

Independently of these future prospects, we have found 
that the TFBSs predicted from ChlP-seq data signif- 
icantly depended on the model used to extract them. 
Since the pairwise interaction model and the developed 
workflow signiflcantly improve TFBS description and re- 
quire a modest computational effort, they should prove 
worthy tools in future data analyses. 



MATERIALS AND METHODS 

Genome-wide data retrieval 

We use both ChlP-on-chip data from Drosophila 
Melanogaster and ChlPseq data from Mus Musculus. 
Data was retrieved from the litterature [26] [27] and 
from ENCODE data accessible through the UCSC web- 
site http : //hgdownload. cse .ucsc . edu/goldenPath/ 
inin9/encodeDCC/wgEncodeCaltechTfbs/, for a total of 
27 TFs. Among them, there are 5 developmental 
Drosophilae TFs: Bap, Bin, Mef2, Tin and Twi, 11 mam- 
mahan stem cells TFs: c-Myc, E2fl, Esrrb, Klf4, Nanog, 
n-Myc, Oct4, Sox2, Stat3, Tcfcp211, Zfx, and 11 factors 



involved in mammalian myogenesis: Cebpb, E2f4, FosU, 
Max, MyoD, Myog, Nrsf, Smadl, Srf, Tcf3, Usfl. Over- 
all, there are between 678 and 38292 ChIP peaks, with 
average size 280bp. DNA sequences were masked for re- 
peats using RepeatMasker [45^ . 



Background models 

It is important to discriminate the statistics of the mo- 
tifs proper from that of the background DNA on which 
motifs are found. Besides particular nucleotides frequen- 
cies, the background DNA can exhibit signiflcant nu- 
cleotide correlations, for instance arisin g fro m CpG de- 
pletion in mammalian genomes (Figure 



S4). For each 



ChlPseq data, we used, as background, all sites from both 
strands of the sequences. This serves to learn indepen- 
dent and pairwise background models which were used as 
reference models to score the corresponding TFBS mod- 
els. The position information content in all plotted PWM 
logos is measured with respect to the nucleotide back- 
ground frequencies {i.e. the independent background 
model) 



Initial PWM refinement 

Along with the ChlPseq data for the different factors, 
we also retrieved corresponding PWMs from the litera- 
ture [26^ or from TRANSFAC database [46]. These initial 
PWMs were reflned according to the following protocol. 

Given ChlPseq data (bound regions) for a given TF 
and an initial PWM of length L (L = 12 was taken for 
all computations in the present paper), we scanned both 
strands of each bound region and attributed to all ob- 
served L-mers a score deflned as the ratio between the 
PWM and background models probabilities. A cutoff was 
set such that half of the bound regions had at least one 
predicted TFBS with a score above the cutoff, setting 
a True Positive Rate (TPR) of 50%. This heuristic cri- 
terium overcame the problem of False Positives among 
the ChlPseq peaks that might have polluted the data. 
This deflned a training set of N L-mers with probability 
higher than the cutoff. Bound sites were again predicted 
using the same cutoff. This procedure was repeated un- 
til stabilization of the predicted sites to a flxed subset. 
This resulted in a reflned PWM with its associated set 
of bound sites. 



Independent model evaluation 

The independent model consist of a matrix of single nu- 
cleotide probabilities of size 4 x L, where L is the width of 
the binding site. In a flrst approximation, the parameters 
appearing in the matrix can be estimated from a set of 
binding sites by computing the observed frequency /^^^ of 
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TABLE II. Information about the TFs 



Name 




^^inde- mixture 


^^inde- pair wise 


mixture- pair wise 




A/mixture 


A^pairwise 


Bap 


678 





12 


12 


2205 


2208 


2117 


Bin 


1857 


2 


80 


81 


1300 


1298 


1228 


Mef2 


4545 





161 


161 


3681 


3681 


3665 


Tin 


1791 





40 
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1333 


1333 


1310 


Twi 


3211 
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128 
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Oct4 


3147 





74 


74 


3187 


3187 


3079 


Smadl 


907 





24 


24 


690 


690 


667 


Sox2 


3523 





95 


95 


2306 


2306 


2293 


STATS 


2099 


54 


58 


62 


2308 


2264 


2231 


Tcfcp211 


22406 





418 


418 


16691 


16691 


16649 


Zfx 


9152 





203 


203 


6473 


6473 


6473 


CEBPB 




QQQ 




oo^ 




8322 


S9(S7 
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1 70^7 
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J- I UcyO 


J- I uuu 


E2f4 


4132 


248 
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"^1 4(S 




FosU 


Uc'O J- 
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tJUOO 


uuoo 


OKJOcf 


Max 


O I OJ- 


24 


70 


O J- 


12531 


12495 




MvoD 


SSQ6Q 


717 


679 


665 


25416 


25430 


25344 


Myog 


38292 


1116 


584 


835 


29520 


29334 


29647 


NRSF 


13756 


639 


672 


488 


13183 


14363 


13440 


SRF 


2370 


1 


34 


35 


2929 


2928 


2948 


TCF3 


9453 


185 


277 


257 


8528 


8690 


8775 


USFl 


8956 


11 


14 


12 


8628 


8619 


8625 



For each TF, we show the number A'chip of ChIP sequences retrieved, the numbers Ainde-pairwise, Ainde-mixture, and 
Apairwise-mixture of different ChIP sequences used for training between either two models, and the numbers Mnde, Admixture, 
and A^pairwise of TFBSs used to learn each model. 



nucleotide h at position i. However, this frequency fluc- 
tuates around the "true" probability due to finite sam- 
ple size, and for example unobserved nucleotides could 
actually have a low probability of being observed pro- 
vided that the number of observations be high enough. 
It is usual to correct for this effect by using the Bayesian 
pseudo-count approach stemming from Laplace's rule of 
succession [3] . The probability to observe nucleotide h at 
position i is given by: 



(3) 



where Ui^b is the number of observed b at position z, N 
is the total number of sites, and a^'s are the pseudo- 
counts, or prior probabilities to observe nucleotide b at 
position i. The pseudo-counts were all set to 1, however 
no significant effect was noted when changing this value. 



as expected from the large number of observations. 



Kullback-Leibler divergence 

The Kullback-Leibler divergence is a measure of dis- 
tance between two probability distributions p and (7 of a 
variable 5, and is defined as: 



DKL(p||5) = ^p(s)log 



(4) 



Throughout this paper, when a DKL is calculated be- 
tween a finite sample and a model distribution, p corre- 
sponds to the sites frequencies in the sample, and q to 
the model distribution. When the DKL is calculated be- 
tween a PWM of a basin of attraction of a metastable 
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state and a PWM from the mixture model, p is used for 
the former, and q for the latter. 



Estimation of the fluctuations due to flnite 
sampling: DKL vs self 

To estimate whether the description of the data by a 
model {e.g. independent or pairwise) could be improved 
or was consistent with the finite number N of observed 
sequences, we computed the 'self DKL between the dis- 
tribution of a set of N sequences drawn from the model 
distribution and the model distribution itself. This pro- 
cedure was repeated 100 times. TFs for which the in- 
dependent model DKL was smaller than or within two 
standard deviations of the self DKL were discarded for 
later analysis. 



Derivation of the pairwise interaction model 

Information theory offers a principled way to deter- 
mine the probabilities of a set of states given some mea- 
surable constraints. It consists in maximizing a func- 
tional known as the entropy [47l |48] over the set of pos- 
sible probability distributions given the imposed con- 
straints. Here, we wish to determine the probability P{s) 
of a DNA sequence s of length L, in the set of TFBSs for a 
transcription factor, given the constraints that the prob- 
ability distribution P retrieves the one- and two-point 
correlations observed in a set of bound DNA sequences. 
We denote by A the alphabet of possible nucleotides, 
A = {A, C, G, T} and by Si the nucleotide at position 
i in the sequence s so that s = si---sL' With these 
notations, the entropy with the considered constraints 
translates into the the following functional: 



C = -^P{s)\nP{s)^xi^P{s)-l \ -^^^h,{a) i^P{s)6{s,,a) - P,{a) 

{s} \{s} J i=laeA \{s} 



L-1 



(5) 



EEE E JiA(^,a')\^Pia)5{s„a)5{s,,s')-Pi,,{a,a') | , 

i=l j>i aeAa'eA KW} 



where Pi{a) (resp. Pij{a^ a')) is the probability of having 
nucleotide a at position i (resp. nucleotides a and a' at 
position i and j) in the TFBS data set. The function S 
denotes the Kronecker (5— function defined by ^(a, a') = 1 
if a = a', and otherwise. The first term in Eq. ([5| is the 
entropy of the probability distribution to be found and 
the other terms are the given constraints along with their 
Lagrangian multipliers. Maximization of the functional 
C is performed in a usual way by setting the functional 
derivative with respect to the probability distribution P 
to zero: 



SjC 
6P{s) 



L L-1 

= - In P(s)-1+A+^ hi{si)^Yl ^^■) 

2=1 i=l j>i 

(6) 



Finally, using the constraint X]{s} -^i^) ~ finds 
the probability distribution that maximizes entropy given 
the constraints that it reproduces the observed one- and 
two-point correlations: 



The normalization constant Z is the partition function, 
Z = ^exp[-H(.)]. (9) 

{s} 



Gauge flxing 

The probability distribution of sequences, as given by 
Eqs. ([7| [8|, is invariant under shifts of the local fields 
hi{a) and under transformations between the interaction 
terms Jij{a^ a') and the local fields. In order to uniquely 
determine this arbitrariness needs to be taken care 
of by adding further conditions that uniquely fix the in- 
teraction parameters, a process known as gauge fixation 
[20] that we detail here. 

a. Local fields. Since it amounts to changing the 
reference energy and is cancelled by the normalization, 
the probability is invariant with respect to the following 
global shift of the hi{a) 



PM=exp[-?/(8)]/Z, 



(7) 



hi{si) hi{si) = hi{si) + £i. 



(10) 



where 1-L{s) is the inhomogeneous Potts model Hamilto- 
nian, 

L L 
n[si...SL] =-Yhi{Si) - ^^Jij{Si,Sj), 

i=i i=i j<i 

8, G{A,C,G,T}. 



We choose to fix this invariance by minimizing the square 
norm Si = ^^eA^^^^^' local field terms with re- 
spect to the gauge degree of freedom. The corresponding 
gauge-fixing condition reads 



0. 



(11) 



aeA 
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This condition can be imposed on any set of fields hi 
by using the symmetry (10) and redefining the fields as 
follows, 



hi{si) hi{si) 



(12) 



aeA 



b. Interaction terms. Another invariance stems 
from the fact that contributions can be shifted between 
local fields and interaction energies. Namely, the follow- 
ing change of variables does not affect the probability: 

(13) 

since the local fields i/ji and can be redistributed in h 
and the constant Cij gives an energy reference for the 
interacting nucleotides that is cancelled by the normal- 
ization process. Again, a gauge condition is obtained by 
minimizing the square norm Sij = a'G^I'^^i^^' ^')]^ 
of interaction terms with respect to the gauge degrees of 
freedom. This yields the conditions: 

^ 4,(a,aO = Yl kji^^^') = 0- (14) 

aeA a'eA 

These can be imposed on a set a of Jjj parameters by 
redefining them as follows: 



a.a'eA 



(15) 



aeA 



aeA 



Determination of the pairwise interaction model 
from the data 



The parameters of the inhomogeneous Potts model in 
Eq. (H, giving the energy of an observed sequence of 
length L, must be computed from the data. The pa- 
rameters h and J represent the energy contributions re- 
spectively coming from individual nucleotides and from 
their interactions. The PWM model is the particu- 
lar case where all the interaction parameters vanish: 
Jij{a,a') = 0. 

To build the model, we start from the PWM descrip- 
tion, characterized by the set of initial hi{a) = log Pi^a 
and the interaction parameters J's set to zero. We add 
one interaction parameter Jij{a^a') at a time, corre- 
sponding to the pair of nucleotides whose pairwise dis- 
tribution predicted by the model differs most from data, 
as estimated by a binomial p- value. We then fit the aug- 
mented model to data, use this model to select a new set 
binding sites from the reads, and repeat the whole pro- 
cedure. In each of these steps, fitting is performed by a 
gradient descent algorithm: 



h ^ h- 



.data 



^model] 
^2 J 

model] 



-1 



J ' 



(16) 
(17) 



where ci and C2 are matrices of size 4 x L and 4L x 4L 
respectively corresponding to the single- and two-point 
frequencies, and superscripts denote whether the matri- 
ces are computed from the data or from the model dis- 
tribution. This algorithm converges to the set of param- 
eters {{hi}^Jij) that match all single marginals and the 
pairwise marginals of interest. The number of interac- 
tion parameters that are being added is controlled by 
the Bayesian Information Criterion, or BIG (Figure [5|. 
The BIG computes the opposite log-likelihood and adds 
a penalty proportional to the number of parameters in- 
volved. This adverts the over-fitting of a finite dataset 
with an extravagant number of parameters. The pro- 
cedure is iterated until minimization of the BIG, yield- 
ing the best pairwise model with the full set of parame- 
ters {{hi{a)}^ {Jij{a^a^)}). As in the case of the PWM 
model, we score each sequence using the ratio between 
the TF and background pairwise models and impose a 
score cutoff so as to select a set of bound sites yielding 
50% TPR, on which a new pairwise model is learned. 
This process is iterated until convergence to a stable set 
of bound sites. 



BIC computation 

Gonsider a sample X = (Xi,...,XAr) of N TF- 
BSs drawn from an unknown distribution function / 
we wish to estimate. To this extent, several models 
{Mi,...,M^} are proposed, each model Mi having a 
density with parameter Oi of dimension Ki. It is 
straightforward to see that, as Ki increases, the fit to 
the observed sample as measured by the likelihood func- 
tion gMi{X\Oi) increases as well, the limiting case being 
when / is estimated as the sample distribution. However, 
such an estimator is inappropriate to account for new, yet 
unobserved TFBSs, i.e. it is not predictive. Such a case 
where the number of parameters used to estimate a dis- 
tribution becomes of the order of the size of the sample is 
known as overfitting. The BIG allows to overcome over- 
fitting by penalizing high dimension parameters. Using 
Bayes Rule, and a uniform a priori distribution on the 
models, we have 



P{M,\X)(xP{X\M,). 



(18) 



That is, the probability of the model given the data can 
be inferred from the probability that the data is gener- 
ated by the model. The latter is obtained by marginaliz- 
ing the joint distribution of the data and the parameters 
over the space of parameters O: 



P{X\M,)= / P{X,0\M,)dO= / gMAX\0)P{0\M,)dO. 

(19) 

For a unidimensional parameter ^, the likelihood 
gMi{X\0) is maximized at some particular Oi with an un- 
certainty (or width) proportional to 1/\/N in the limit 
of large N . Assuming a broad prior, then for large N 
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the integral is dominated by the hkehhood which is con- 
centrated around its maximum. One can then approx- 
imate the integral by the area of the region of height 
the maximum likelihood and of width l/v^TV, that is 
gMi {X^ 6i) I \fN . This result can be retrieved analytically 
using the method of steepest descent. For a number Ki 
of parameters, one gets a total volume gMi{X^ 9i)/N^^^'^ 
[31j . Taking the logarithm yields the BIG condition: 

BIC, = -21og(P(X|M,)) ~ -2\og{gMAXji))+K,\og{N) 

(20) 

In the present case, the sample X is the set of observed 
TFBSs and the model Mi determines the probability 
PMi{s) of belonging to X, 

logigM.iXji)) = Yl log[^M.(^?.)((^)]- (21) 

sex 



The interpretation of Eq. (20) is clear: adding new 



parameters improves the fit, but also adds new sources 
of uncertainty about these parameters due to the finite 
size of the data. This uncertainty disappears as ^ oo, 
since the log-likelihood scales with N while the correction 
scales with log(7V). 

Finally, Eq. (20) is a functional over models, the chosen 



model Mbic is the one that minimizes it. 



Mbic = argmin BICi. 

Mi 



(22) 



PWM mixture model 

We investigated an approach based on a mixture of 
PWMs. For that purpose, we used a comparable setup 
as for the pairwise model. However, instead of adding 
correlations to a given PWM, new PWMs were added to 
a mixture model. More precisely, a mixture of K PWMs, 
with 1 < if < 10, was generated by using a K-means al- 
gorithm with a Hamming distance metrics on the initial 
set of bound sites. This resulted in K clusters, each com- 
prising rik sites among the initial N sites. A PWM was 
generated on each of these clusters, with probability dis- 
tribution Vk- The mixture model of order K was then 
defined as [3T1: 



K 



(23) 



where pk = n^/N is the cluster weight. Because a PWM 
has 3 X L degrees of freedom {L of them being constrained 
by the summation of nucleotide probabilities to one) and 
there are K — 1 free weight parameters, the number of 
parameters corresponding to a mixture of order K is 
3LK + {K — 1). As previously, the model showing mini- 
mal BIG score was used for sites detection, a new set of 
PWMs and weights pk was generated by clustering the 
set of detected sites and the procedure was iterated until 
convergence to a stable set of sites. 



Metastable minima of the pairwise interaction 
model and their basins of attractions 



We defined the basins of attraction of a pairwise inter- 
action model energy landscape, in the following fashion. 
Let s be a site with energy 1-L{s). We looked for the nu- 
cleotides that could be changed to minimize 1-L{s). If such 
nucleotides existed, one of them was chosen at random, 
and its value was updated. One local minimum of the en- 
ergy landscape, or metastable state, was reached when no 
such nucleotide could be found. The basin of attraction 
of a metastable state was then defined as the ensemble 
of sites that fell to this metastable state when their en- 
ergy was minimized following the above procedure. We 
computed metastable states and their basins of attrac- 
tion for the final set of bound sites obtained with the 
best pairwise model. A PWM was learned on each basin 
of attraction, leading to a set of representative PWMs, 
with different weights representing different proportions 
of bound sites in their basins. 



Computation of the Direct Information 

We wanted to build a quantity based solely on direct 
interactions Jij between nucleotides, discarding indirect 
interactions. To this end, we used the interaction pa- 
rameters obtained from the pairwise model to build the 
direct dinucleotide probability function: 

PfAa.a') = e^^(^)+^^(^')+^^'^-^«'^')/Zi,^-, (24) 



where 



a, a' 



The 8 effective fields hi and hj were fully determined 
by the constraints that the direct probability function 
matches the observed one-point frequencies: 



^i^^.(a,aO = P^{a), a' e {A,C,G,T}, 

a' 

Y,Pt^^{a,a')=Pj{a'), ae{A,C,G,T}. 



(25) 



The normalization of the probabilities ^^Pi{o) = 1, 
served to reduce this system to 6 equations. The fields 
hi{a)^ which are determined up to a constant, were fixed 
by the gauge condition that they vanished for the nu- 
cleotide A, h{A) = 0. The system was solved using the 
Levenberg-Marquadt algorithm with A = 0.005. 
The Direct Information |33 was then defined as: 



a, a' \ \ / J \ / 



(26) 
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As there is no upper bound for this direct information, 
we built a normahzed version of the direct information: 



NDL 



DL 



(27) 



where Si denotes the entropy at position i. Note that 
Si = Dli i so that NDIii = 1 for this maximally cor- 
related case. On the contrary, independent nucleotides 
give NDIij = Dlij = 0. 



Participation Ratio 

For each TF, an interaction weight was defined for each 
pair of nucleotides as 



Wi 



= ndi,jY.ndi,^j. 



(28) 



Self-interactions have no meaning here and were at- 
tributed Wi^i = 0. Let us note N = L{L — 1) the number 
of possible interactions. Using our weight, one writes the 
Participation Ratio as: 



R 



1 



(29) 



w. 



The interpretation is simple: if all weights are equal, 
1/N and = 1, that is all possible interactions are 



represented. Conversely, if only one interaction accounts 
in the distribution budget, then R = l/N, meaning that 
only one of all possible interactions is represented. 



Distance between interactions 

The previously defined interaction weights were aver- 
aged over all possible pairs of nucleotides at a given dis- 
tance d of one another, yielding the distance distribution: 



P{\^-J\=d)=Z-'J^ ^ (30) 



\i-j\=d 



where 



N-l 



N-d ^ '-^ 

= 1 \i-j\=d 



(31) 



is a normalization factor. Note that we introduced a cor- 
rection 1/{N — d) to account for finite-size effects, namely 
the fact that randomly distributed interactions will lead 
to an overrepresentation of nearest neighbours interac- 
tions just because these are more numerous. 

Interaction matrix and Hopfield patterns 

In the Hamiltonian shown in ([T]), only 16L(L — l)/2 
terms appear in the interaction budget: indeed, we forbid 
self-interations (already accounted for by the local field h) 
and do not count the interactions twice. However, we can 
straightforwardly extend the interaction matrix to a full 
symmetric matrix J(i,o),(j,6) of size (4L)^, with 4L- valued 
indices (^,a),z G {!,••• •,L}^a G A. The matrix J is 
such that for i > j, J(^i^a){j,b) = ^) with furthermore 

-^(i,a),(i,5) = and J(i,a),(j,5) = -^(j,6),(i,a)- The energy of 
a sequence s can then be written with these notations 

^ L L 

Jij{si,Sj) = ::YYj{i,s.),{j,s,) =v{syjv{s), 



= 1 i = i 



(32) 

where in the last equality the "f" sign denotes vector trans- 
position and we have introduced the 4L vector v{s) asso- 
ciated to sequence 5, v{s)i^a = 1 if a = and v{s)i^a = 
otherwise. 

Since the matrix J is symmetric, it can be diagonalized 
in an orthonormal basis of eigenvectors /c = 1, • • • , L 
with real eigenvalues A/e, 



J 



k 



fet 



(33) 



Denoting by ^ the coordinates of the k-th eigenvector 



then, one can rewrite Eq. (|32|) as 

Ji,j {^ij Sj) 



E 

l<i<j<L 



ifl^^lh^ ■ (34) 
k=i \i=i / 



Finally, the full Hamiltonian is given by: 



n 



i k=l \i=l / 



(35) 
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FIG. SI. Dependence of the fit on the number of ChIP sequences. For each TF, the number of available ChIP sequences 
is plotted vs. the improvement in the description of its TFBS statistics, provided by the he pairwise model as compared to the 
PWM independent model. The latter is quantified by the ratio of DKL between the respective model probability distributions 
and the experimental ones provided by the ChIP data, DKLpw/DKLinde- The improvement afforded by the pairwise model is 
clearly seen to be correlated to the number of ChIP sequences available. 
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FIG. S2. Same as Figure [6] of the main text for all considered factors described by a mixture model with two or more PWMs. 



27 

Bin CEBPB CTCF 




e2f1 E2f4 FosM 




28 



■I) 



NRSF 

?A s^A_g^ 



Oct4 






0.045 


li 1 




0.04 


^o 




0.035 


H 




0.03 






0.025 






0.02 


— 1 


■0.015 


OH 


I0.01 




I0.005 


M 


■0 





Sox2 



0.08 
0.07 
0.06 
0.05 
0.04 
0.03 
0.02 
0.01 




TCF3 

III 



Tcfcp2l1 



USF1 




Zfx 



o 
o 



i 



0.06 
0.05 
0.04 
0.03 
0.02 
0.01 




FIG. S3. Same as Figure [7| of the main text for the other considered factors. 
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FIG. S4. Background correlations (A,B,C) Heat maps showing the correlations between nucleotides in the ChIP data of 
the 3 factors from the main text. Because of translation invariance, we only show the correlations between a nucleotide (rows) 
and the next nearest (first four columns) to farthest (last four columns) nucleotides, using the binding site length of L = 12. We 
see in the Drosophila data the appreciable presence of repeated sequences (of type AA, TT, CC, and GG). In the mammalian 
data sets, we observe the known CpG depletion. (A',B',C') Heat maps showing the values of the Normalized Direct Information 
between pairs of nucleotides. 
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FIG. S5. Variable spacer length We learned a pairwise model for Esrrb including the 4 flanking nucleotides on the left of 
the main motif. (A) The metastable states of this model show a feature not captured in the main text where binding sites are 
defined symmetrically around the center of mass of the information content: namely a 'GAG' trinucleotide with variable spacer 
length from the main motif. This feature is apparent in the first 3 logos shown here. (B) The contribution of this trinucleotidic 
interaction to the Direct Information is captured through strong direct links between the 4 flanking nucleotides, showing that 
the pairwise model is implicitly able to capture higher order correlations. Logos from the PWM model are surrounding the 
heatmap for clarity. 



